theoretical result
ADerivation of D1 Denote the logit vector as x, we have pj = exj
Without zero-mean constraint, the training becomes unstable. Following the training setting of [23], the classifier network is trained with SGD with a weight decay 5e-4, an initial learning rate of 1e-1 and a mini-batch size of 100 for all methods. We use the cosine learning rate decay schedule [49] for a total of 80 epochs. We set the outer level learning ηω as 14 Figure 7: Training curve without zero-mean constraint on CIFAR10 under 40% uniform noise. The MLP weighting network is trained with Adam [51] with a fixed learning rate 1e-3 and a weight decay 1e-4.
Approximations for the computation of m
Providing a very low critical probability pc means that certification occurs when the simulation ends after a large number of iterations m. We introduce `c the threshold associated to pc s.t. Table 5 shows that this approximation is excellent even for large pc. This shows that mis a little larger than mc = log(pc)/log(1 1/N). This section assumes that X = xo + σ X with X N(0n; In) and that h(x) = x>g τ with g Rn and kgk= 1 (w.l.o.g.).
Supplement to " Uniform Concentration Bounds toward a Unified Framework for Robust Clustering "
For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function φis bounded in the `2-norm when the domain is restricted to the support of P. Lemma A.1. Lemma A.3 proves that the function fΘ, as a function of Θ, is Lipschitz with respect to the k k norm. Joint first authors contributed equally Corresponding author 35th Conference on Neural Information Processing Systems (NeurIPS 2021). Thus, from equation (1), h φ(PC(θ)) φ(θ),x PC(θ)i 0. (2) We now observe that, dφ(x,θ) dφ(x,PC(θ)) dφ(PC(θ),θ) = h φ(PC(θ)) φ(θ),x PC(θ)i 0. Hence the result.
lower bound
While there remains a small gap between our main lower bound of Theorem 3 and the deterministic quantised gradient descent of Section 6, we can show that the gap cannot be closed by improved deterministic algorithms where the coordinator learns value of objective function F(x) in addition to the minimiser x. That is, our quantised gradient descent is the communication-optimal deterministic algorithm for variant (1) for objectives with constant condition number. Recall that in the N-player equality over universe of size d, denoted by EQd,N, each player i is given an input bi 2{ 0,1}d, and the task is to decide if all players have the same input. It is known [33] that the deterministic communication complexity of EQd,N is CC(EQd,N)= ( Nd). Theorem 8. Given parameters N, d, ", 0 and = 0N satisfying d /" = (1), any deterministic protocol solving (1) for quadratic input functions x 7! 0kx x0k22 has communication complexity Nd log( d/"), if the coordinator is also required to output estimate r 2 R for the minimum function value such that Assume is a deterministic protocol solving (1) with communication complexity C .We show that can then solve N-party equality over a universe of size D = ( dlog( d/")), implying C = ( ND)= Nd log( d/") . More specifically, let S be the set given by Lemma 2 with =(2 "/)1/2, and let D = dlog|S|e = (dlog( d/")). Note that since we assume d /" = (1), the set S has at least two elements and D 1.
Supplementary Materials: An Empirical Study of Adder Neural Networks for Object Detection
As discussed in prior literature [1, 4], one operation of floating-point addition and multiplication have energy costs of 0.9 pJ and 3.7 pJ, respectively. Meanwhile, one operation of 8-bit integer addition and multiplication have 0.03 pJ and 0.2 pJ energy costs, demonstrating much lower cost than floating-point operation. Therefore, it is important to explore whether adder detectors performs well for INT8 quantization. We tried to adopt INT8 post quantization for our Adder FCOS (B+N) model, which suffers 0.8 mAP drop compared with full precision model, as shown in Table A. The energy reduction further increases from 29% to 35%. Note that post training quantization is not optimal for INT8 models, and quantization-aware training may greatly further improve the accuracy.
EasyToHard
A.1 Datasets Details of the datasets we introduce are presented in this section. Specific details about generation as well as statistics from the resulting datasets are delineated for each one below. A.1.1 Prefix sum data Binary string inputs of length nare generated by selecting a random integer in [0,2n)and expressing its binary representation with n digits. Datasets are produced by repeating this random process 10,000 times without replacement. Because the number of possible points increases exponentially as a function of n and the size of the generated dataset is fixed, it is important to note that the dataset becomes sparser in its ambient hypercube as nincreases.